尽管完全监督的人类骨架序列建模成功,但使用自我监督的预训练进行骨架序列表示学习一直是一个活跃的领域,因为很难在大规模上获取特定于任务的骨骼注释。最近的研究重点是使用对比学习学习视频级别的时间和歧视性信息,但忽略了人类骨骼的层次空间时间。与视频级别的这种表面监督不同,我们提出了一种自我监督的分层预训练方案,该方案纳入了基于层次变压器的骨骼骨骼序列编码器(HI-TRS),以明确捕获空间,短期和长期和长期框架,剪辑和视频级别的时间依赖性分别。为了通过HI-TR评估提出的自我监督预训练方案,我们进行了广泛的实验,涵盖了三个基于骨架的下游任务,包括动作识别,动作检测和运动预测。根据监督和半监督评估协议,我们的方法实现了最新的性能。此外,我们证明了我们的模型在训练阶段中学到的先验知识具有强大的下游任务的转移能力。
translated by 谷歌翻译
组合来自多视图图像的信息对于提高自动化方法的疾病诊断方法的性能和鲁棒性至关重要。但是,由于多视图图像的非对齐特性,跨视图的构建相关性和数据融合在很大程度上仍然是一个开放的问题。在这项研究中,我们提出了输血,这是一种基于变压器的体系结构,可使用卷积层和强大的注意机制合并不同的多视图成像信息。特别是,针对丰富的跨视图上下文建模和语义依赖性挖掘,提出了发散的融合注意(DIFA)模块,以解决从不同图像视图中捕获未对齐数据之间的长期相关性的关键问题。我们进一步提出了多尺度注意(MSA),以收集多尺度特征表示的全局对应关系。我们评估了心脏MRI(M \&MS-2)挑战队列中多疾病,多视图\&多中心右心室分段的输血。输血表明了针对最先进方法的领先绩效,并为多视图成像集成的新观点打开了稳健的医学图像分割。
translated by 谷歌翻译
自上而下的实例分割框架与自下而上的框架相比,它在对象检测方面表现出了优越性。虽然它有效地解决了过度细分,但自上而下的实例分割却遭受了过度处理问题。然而,完整的分割掩模对于生物图像分析至关重要,因为它具有重要的形态特性,例如形状和体积。在本文中,我们提出了一个区域建议纠正(RPR)模块,以解决这个具有挑战性的分割问题。特别是,我们提供了一个渐进式皇家模块,以逐渐将邻居信息引入一系列ROI。 ROI功能被馈入专门的进料网络(FFN)以进行提案框回归。有了其他邻居信息,提出的RPR模块显示了区域建议位置的校正显着改善,因此与最先进的基线方法相比,在三个生物图像数据集上表现出有利的实例分割性能。实验结果表明,所提出的RPR模块在基于锚固的和无锚的自上而下实例分割方法中有效,这表明该方法可以应用于生物学图像的一般自上而下实例分割。代码可用。
translated by 谷歌翻译
Unmanned aerial vehicle (UAV) swarms are considered as a promising technique for next-generation communication networks due to their flexibility, mobility, low cost, and the ability to collaboratively and autonomously provide services. Distributed learning (DL) enables UAV swarms to intelligently provide communication services, multi-directional remote surveillance, and target tracking. In this survey, we first introduce several popular DL algorithms such as federated learning (FL), multi-agent Reinforcement Learning (MARL), distributed inference, and split learning, and present a comprehensive overview of their applications for UAV swarms, such as trajectory design, power control, wireless resource allocation, user assignment, perception, and satellite communications. Then, we present several state-of-the-art applications of UAV swarms in wireless communication systems, such us reconfigurable intelligent surface (RIS), virtual reality (VR), semantic communications, and discuss the problems and challenges that DL-enabled UAV swarms can solve in these applications. Finally, we describe open problems of using DL in UAV swarms and future research directions of DL enabled UAV swarms. In summary, this survey provides a comprehensive survey of various DL applications for UAV swarms in extensive scenarios.
translated by 谷歌翻译
In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
translated by 谷歌翻译
Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR). However, predicting tokens that appear less frequently in the training set is still quite challenging. The long-tail prediction problems have been widely studied in many applications, but only been addressed by a few studies for ASR and LMs. In this paper, we propose a new memory augmented lookup dictionary based Transformer architecture for LM. The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens. With intensive experiments on Chinese and English data sets, our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate. This is achieved without impact on the decoding efficiency. Overall, we demonstrate the effectiveness of our proposed method in boosting the ASR decoding performance, especially for long-tail tokens.
translated by 谷歌翻译
It is crucial to evaluate the quality and determine the optimal number of clusters in cluster analysis. In this paper, the multi-granularity characterization of the data set is carried out to obtain the hyper-balls. The cluster internal evaluation index based on hyper-balls(HCVI) is defined. Moreover, a general method for determining the optimal number of clusters based on HCVI is proposed. The proposed methods can evaluate the clustering results produced by the several classic methods and determine the optimal cluster number for data sets containing noises and clusters with arbitrary shapes. The experimental results on synthetic and real data sets indicate that the new index outperforms existing ones.
translated by 谷歌翻译
Generalizability to unseen forgery types is crucial for face forgery detectors. Recent works have made significant progress in terms of generalization by synthetic forgery data augmentation. In this work, we explore another path for improving the generalization. Our goal is to reduce the features that are easy to learn in the training phase, so as to reduce the risk of overfitting on specific forgery types. Specifically, in our method, a teacher network takes as input the face images and generates an attention map of the deep features by a diverse multihead attention ViT. The attention map is used to guide a student network to focus on the low-attended features by reducing the highly-attended deep features. A deep feature mixup strategy is also proposed to synthesize forgeries in the feature domain. Experiments demonstrate that, without data augmentation, our method is able to achieve promising performances on unseen forgeries and highly compressed data.
translated by 谷歌翻译
This paper presents a novel framework for planning in unknown and occluded urban spaces. We specifically focus on turns and intersections where occlusions significantly impact navigability. Our approach uses an inpainting model to fill in a sparse, occluded, semantic lidar point cloud and plans dynamically feasible paths for a vehicle to traverse through the open and inpainted spaces. We demonstrate our approach using a car's lidar data with real-time occlusions, and show that by inpainting occluded areas, we can plan longer paths, with more turn options compared to without inpainting; in addition, our approach more closely follows paths derived from a planner with no occlusions (called the ground truth) compared to other state of the art approaches.
translated by 谷歌翻译
In this work, we investigate improving the generalizability of GAN-generated image detectors by performing data augmentation in the fingerprint domain. Specifically, we first separate the fingerprints and contents of the GAN-generated images using an autoencoder based GAN fingerprint extractor, followed by random perturbations of the fingerprints. Then the original fingerprints are substituted with the perturbed fingerprints and added to the original contents, to produce images that are visually invariant but with distinct fingerprints. The perturbed images can successfully imitate images generated by different GANs to improve the generalization of the detectors, which is demonstrated by the spectra visualization. To our knowledge, we are the first to conduct data augmentation in the fingerprint domain. Our work explores a novel prospect that is distinct from previous works on spatial and frequency domain augmentation. Extensive cross-GAN experiments demonstrate the effectiveness of our method compared to the state-of-the-art methods in detecting fake images generated by unknown GANs.
translated by 谷歌翻译